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ABSTRACT 

Research studies in Educational Data Mining (EDM) often 
involve several variables related to student learning activi- 
ties. As such, it may be necessary to run multiple statisti- 
cal tests simultaneously, thereby leading to the problem of 
multiple comparisons. The Benjamini-Hochberg (BH) pro- 
cedure is commonly used in EDM research to address this 
issue, and it has proven to be a useful method. However, the 
main limitation of the procedure is that it requires the statis- 
tical tests to either be independent or satisfy certain depen- 
dency conditions. The Benjamini-Yekutieli (BY) procedure 
is an alternative that can be applied under arbitrary depen- 
dence assumptions, but this extra flexibility comes with a 
loss of statistical power; hence, the BH procedure is pre- 
ferred whenever it can be properly applied. Based on these 
considerations, in this work we employ simulation studies to 
assess the validity of the BH procedure in two scenarios com- 
mon to EDM research. The first scenario considers the eval- 
uation and comparison of different classification models— 
such an analysis might occur, for instance, during the model 
tuning and validation stage of a study. Then, in the second 
scenario we look at experiments involving the study of state 
transitions in sequential data, examples of which occur in 
affect dynamics research. We find that the BH procedure 
performs as expected when used with simulated classifica- 
tion model predictions; however, when applied to simulated 
sequential data, it does not perform at the expected level. 
Based on these results, as well as previous studies evaluating 
the BH and BY methods, we discuss the appropriate usage 
of these procedures for the scenarios under examination. 
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1. INTRODUCTION 


Consider a statistical analysis that tests several different null 
hypotheses, either on a single data set, or on related data 
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sets. In such a scenario, the probability of making a dis- 
covery—i.e., rejecting a null hypothesis—is higher than in 
an analysis involving a single null hypothesis. Thus, it fol- 
lows that the probability of rejecting a true null hypothesis 
increases as well; such errors are variously called false posi- 
tives, false discoveries, or type I errors. This is known in the 
statistics literature as the multiple comparisons problem. 


Studies in Educational Data Mining (EDM) and related 
fields are shaping the ongoing research and development of 
learning systems that are increasingly becoming part of ev- 
eryday classrooms—thus directly impacting student lives. 
Greater attention is needed to ensure that the conclusions 
drawn from these studies are reliable. Along these lines, 
controlling for multiple comparisons is an important consid- 
eration, as it has been argued that addressing the issue is a 
major factor in ensuring the replicability of scientific results 
[2]. Additionally, many exaggerated or even incorrect re- 
sults can be explained by the testing of multiple hypotheses 
without adjusting for the number of comparisons [34, 40]; 
while this issue commonly occurs with observational data, 
experimental studies are not immune to the problem [30]. 


The main focus of this study is the Benjamini-Hochberg 
(BH) procedure [3], a method that is commonly applied in 
EDM research to control the false discovery rate (FDR)— 
defined as the expected rate of false discoveries among all the 
discoveries made—when multiple statistical tests are used. 
One complication with using the BH procedure is that, in 
order for the theoretical guarantees on its performance to 
hold, the statistical tests must either be independent or sat- 
isfy certain dependency conditions [3, 4]. The Benjamini- 
Yekutieli (BY) procedure is an alternative method that can 
be used under arbitrary dependence assumptions among the 
statistical tests [4]. As the BY procedure is more gener- 
ally applicable than the BH procedure, it is by necessity 
more conservative and thus less likely to classify a result 
as being statistically significant; in turn, this causes it to 
have lower statistical power compared to the BH procedure. 
Thus, the BH procedure is to be preferred over the BY pro- 
cedure whenever it can be properly applied. 


However, the difficulty is that verifying the conditions for 
applying the BH procedure is not always straightforward; 
while some scenarios have been mathematically proven to 
satisfy these conditions, many common examples have not 
been. For instance, as of 2010 the case of pairwise com- 
parisons had not been mathematically proven to satisfy the 
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conditions for using the BH procedure [1], and to the best of 
our knowledge that has not changed in the interim. Because 
it’s not always clear if the conditions for applying the BH 
procedure are satisfied, it is often used without any theo- 
retical guarantees on its performance [15]. In other situa- 
tions, researchers may resort to using both the BH and BY 
procedures and comparing the results [28]. Motivated by 
these issues, in this work we investigate two different sce- 
narios that occur within EDM research, with the goal of 
understanding if the BH procedure is appropriate for each 
situation. In both scenarios, we assume that a researcher 
wants to control the FDR, ideally with the BH procedure, 
but is unsure if it will work as desired. As we are unable to 
provide mathematical proofs for these scenarios, we instead 
turn to simulation studies, a procedure that is commonly 
used to investigate the performance of multiple comparison 
procedures [1, 3, 14, 22, 31, 32, 38, 39]. 


The outline of the paper is as follows. We first discuss the 
specifics of the BH and BY procedures and how to apply 
them when performing multiple hypothesis tests; addition- 
ally, we also look at how multiple comparisons are handled 
in the EDM community by surveying the literature from the 
last five EDM conference proceedings. Then, in the remain- 
der of the paper we evaluate the BH and BY procedures for 
two scenarios that EDM researchers may encounter in their 
work. The first scenario concerns the usage of these proce- 
dures for evaluating and comparing the performance of clas- 
sification models. In this scenario, we make pairwise com- 
parisons of simulated classifiers, using both accuracy and the 
area under the receiver operating characteristic curve (AU- 
ROC) to evaluate their performance; such a situation can 
occur, for example, when trying to find the best performing 
combinations of model algorithms and hyperparameters. 


The next scenario we look at is the analysis of state tran- 
sitions in sequential data. In such an analysis, researchers 
typically run several hypothesis tests to try and determine 
the importance of the various transitions between states. 
Examples of this occur in affect dynamics research, where 
the BH procedure is commonly used [18, 29]. Here, we run 
analyses on simulated sequences of transitions using two dif- 
ferent statistical measures, and we then apply the BH and 
BY procedures and compare the results. Finally, based on 
the results of our numerical experiments, as well as the exist- 
ing literature on controlling the FDR, we discuss the usage 
of the BH and BY procedures in these scenarios. 


2. CONTROLLING FOR MULTIPLE COM- 
PARISONS 
2.1 Benjamini-Hochberg and 


Benjamini- Yekutieli Procedures 

In this study we focus on procedures for controlling the false 
discovery rate (FDR). The FDR was introduced in [3], and 
it has since found widespread use in many scientific fields in- 
cluding education research [38], genetics [31, 35], and medi- 
cal studies [4]. If we let V be the number of false discoveries 
and S be the number of true discoveries, as done in [3] we 
can define the quantity Q as 


ae one ifV+S>0, (1) 
~ 10, otherwise. 


Then, the FDR is equal to E[Q], the expected proportion of 
false discoveries among all the discoveries made. 


The family-wise error rate (FWER), which is defined as the 
probability of making at least one false discovery when per- 
forming a set of hypothesis tests, is another measure com- 
monly associated with the problem of multiple comparisons. 
Although the Bonferroni correction is probably the most 
famous procedure used to control the FWER, there exist 
many other alternatives. However, while such procedures 
can be useful in situations in which a false discovery must 
be avoided at all costs, such as clinical trials of new medical 
treatments [16], the downside to these methods is a loss of 
statistical power, resulting in an increased likelihood of miss- 
ing true discoveries. While procedures for controlling the 
FWER are concerned with the occurrence of any false dis- 
coveries, FDR controlling procedures are slightly more per- 
missive, as they allow a certain proportion of the discoveries 
to be false. Thus, the advantage of FDR controlling pro- 
cedures is that they typically have greater statistical power 
and, as such, a better chance of correctly identifying true dis- 
coveries; the resulting trade-off is that false discoveries are 
more likely. However, this trade-off can be beneficial when 
a large number of hypothesis tests are being conducted,’ or 
when the research is of a slightly more exploratory nature. 


In addition to introducing the FDR to the scientific litera- 
ture, the authors in [3] also outlined the BH procedure. As 
shown there, the BH procedure is mathematically proven 
to control the FDR at a given level when the statistical 
tests—or, equivalently, the test statistics—are independent. 
However, in many practical applications the statistical tests 
may have some underlying dependence between them. With 
these situations in mind, further important work on control- 
ling the FDR appeared in [4], where the authors proved that, 
in addition to the independent case, the BH procedure is 
valid under certain dependency conditions between the sta- 
tistical tests. Among other scenarios, it was shown that the 
BH procedure properly controls the FDR with multivariate 
normal test statistics having nonnegative correlations. Ad- 
ditionally, the authors in [4] introduced the BY procedure for 
situations in which the BH procedure is not valid, and they 
proved that the BY procedure controls the FDR regardless 
of the dependence between the tests. 


In the remainder of this section we discuss the application 
of the BH and BY procedures. Consider a statistical analy- 
sis that involves the testing of m null hypotheses. Of these 
null hypotheses, mo < m are true null hypotheses—these 
correspond to the hypotheses that we expect a statistical 
test to classify as not being significant—while the remaining 
m—mo hypotheses are the false null hypotheses. Note that, 
in practice, mo is an unknown value. Let Pi,..., Pm be the 
p-values for the m statistical tests, with these values being 
listed in ascending order; the corresponding null hypothe- 
ses are then represented by H1,..., Hm. The relationships 


‘As a relatively extreme example, statistical analyses in ge- 
netics research can involve thousands of hypothesis tests, 
and in such cases FWER controlling procedures can be 
overly restrictive [1]. 
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between these various terms can be summarized as follows. 


Not significant 
True null U 
False null T 


e m= total number of hypotheses being tested 
@ mo = number of true null hypotheses 


e V = number of false positives (i.e., false discoveries or 
type I errors) 


e S = number of true positives 
e T = number of false negatives (i.e., type II errors) 


e U = number of true negatives 


Let q represent our chosen threshold—or, level—for control- 
ling the FDR, and define FDRmax = mo 9. If the statistical 
tests are independent, or if they satisfy certain dependency 
conditions, it was shown in [4] that the FDR resulting from 
an application of the BH procedure is at most FDRmax. Such 
an application works as follows. Assuming once again that 
the p-values are in ascending order, we find the largest in- 
teger k such that Py < £q. Then, we simply reject all the 
null hypotheses H; for which i < k. 


Next, as the BY procedure controls the FDR under arbitrary 
dependence assumptions, it is necessarily more conservative 
when rejecting a null hypothesis. This takes the form of a 
lower threshold for the upper bound used to determine the 
“significance” of the p-values. Specifically, we find the largest 
integer k such that Pp < sare where c(m) = S77", ;. 
Using this procedure, it was shown in [4] that the resulting 
FDR is bounded above by FDRmax = mo g, regardless of the 


type of dependence between the statistical tests. 


To see how these procedures work, we next look at an exam- 
ple. Suppose we run 10 separate statistical tests (m = 10) 
that return the following p-values. 


0.002, 0.008, 0.011, 0.013, 0.023, 
0.028, 0.092, 0.214, 0.647, 0.853 


Next, we compare these p-values to the formulas used for 
the BH and BY thresholds, using a value of g = 0.1; for 


added context, we also include the results for the Bonferroni 
correction. For each method, the thresholds that correspond 
to statistically significant p-values are in bold. 

k P. Pu B Bonterionl 

mad ae: qd md 

1 0.002 |} 0.01 0.003 0.01 

2 0.008 || 0.02 0.007 0.01 

3 0.011 || 0.03 0.010 0.01 

4 0.013 |} 0.04 0.014 0.01 

5 0.023 || 0.05 0.017 0.01 

6 0.028 || 0.06 0.020 0.01 

7 0.092 || 0.07 0.024 0.01 

8 0.214 || 0.08 0.027 0.01 

9 0.647 || 0.09 0.031 0.01 

10 0.853 |] 0.1 0.034 0.01 


For the BH procedure, we can see that k = 6 is the largest 
value for which Py < ~4, as we have 0.028 < 0.06. Thus, 
the BH procedure, using a value of 0.1, would reject the 
null hypothesis for the statistical tests corresponding to the 
lowest six p-values. Next, for the BY procedure we see that 
k = 4 is the largest value for which P, is less than the cor- 
responding threshold; in this case, we have 0.013 < 0.014. 
It’s worth noting that, in this example, even though both P2 
and P3 are not below the corresponding thresholds, the BY 
procedure still classifies them as being statistically signifi- 
cant. This is a feature of FDR controlling procedures that, 
in many cases, allows them to be more permissive than pro- 
cedures for controlling the FWER. 


2.2 Applications in EDM Research 


To understand how EDM research is controlling for multi- 
ple comparisons, we reviewed EDM conference proceedings 
from the last five years (2016-2020). We found that, among 
the 22 papers that reported controlling for multiple com- 
parisons,” half used the Bonferroni correction and half used 
the BH procedure, with no studies using the BY procedure. 
Based on the method used to perform the comparisons, the 
studies can be partitioned as follows: group comparison (8), 
pairwise comparison (8; including pairwise model compari- 
son), correlation (4), and regression analysis (2). The studies 
involving group comparisons used statistical methods such 
as the Mann-Whitney U test, chi-squared test, t-test, and 
ANOVA. The studies employing pairwise comparisons used 
methods such as the Kruskal-Wallis test, Mann-Whitney U 
test, McNemar’s test, chi-squared test, and t-test. Overall, 
these 22 studies investigated diverse educational constructs 
in virtual learning environments including affect, student 
behavior in MOOCs, help-seeking, collaboration, and self- 
regulation. 


The choice between the Bonferroni correction and the BH 
procedure varied in the studies, as the selection was not com- 
pletely determined by the study methodology. For instance, 
an exploratory study used the more conservative Bonferroni 
method for a correlational analysis [61], while an experimen- 
tal study with group comparisons used the less conservative 
BH procedure [46]. For EDM research, selecting between the 
Bonferroni correction and the BH procedure may not be uni- 
versal and likely depends on the context of the study. As an 
example, consider that an analysis examining student demo- 
graphic differences on an important educational construct— 
such as self-efficacy, affect, or learning—likely has fewer data 
samples from underrepresented minorities [20]. In such a 
case, penalizing the statistical power with a more conser- 
vative method like the Bonferroni correction may lead to 
missed opportunities for critical discoveries related to eq- 
uity. On the other hand, contrast this with the evaluation 
of an expensive and large-scale educational technology inter- 
vention in a public school system; given the costs involved, 
both financially and otherwise, it could be argued that such 
an evaluation requires a more conservative approach to con- 
trol for false discoveries. 


More broadly, EDM research may not always involve large 
data sets. This is particularly true for educational constructs 
that require resource-intensive data collection procedures— 


2See Section 8 for the full list of references. 


Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 35 


e.g., training coders, gathering approvals, and conducting 
classroom studies. Hence, using the Bonferroni correction 
to control for multiple comparisons at the expense of losing 
statistical power may not always be affordable. In contrast, 
using the BH procedure in scenarios that violate its sta- 
tistical assumptions may lead to invalid conclusions. Our 
review of EDM studies from the last five years also revealed 
that the field may not be taking advantage of the BY proce- 
dure, especially in scenarios where it is difficult to meet the 
assumptions of the BH procedure. These observations are 
what motivated us to investigate the use of the BH and BY 
procedures in research settings relevant to EDM. 


3. METHODS 


In this section we outline the general procedure we follow 
for our simulation studies. Since evaluating multiple com- 
parison procedures requires knowledge of whether a null hy- 
pothesis is true, and as this isn’t typically known with real 
data, simulations are commonly used for such analyses. In 
all of our experiments, we begin by generating simulated 
data according to a given probability distribution. While 
the specifics of this procedure vary for the two scenarios we 
consider, the common thread is that this must be done in a 
way as to have control over whether or not each null hypoth- 
esis is true. For example, in our comparisons of simulated 
classification models, the performance of each model is con- 
trolled by a single parameter; thus, when this parameter 
differs for two models, the null hypothesis that the models 
perform equally well is false. 


Another important detail is that, as we are focusing on two 
particular scenarios, we can generate simulated data specific 
to these scenarios. That is, for the model comparison exper- 
iments we simulate both the classifier predictions and the 
ground truth labels; then, for the state transition analysis 
we generate simulated sequences of states. By simulating 
the underlying data for each scenario, we are attempting to 
evaluate the BH and BY procedures in conditions that are 
as realistic as possible. In comparison, other studies that 
are more general in nature may simulate the distribution of 
the test statistics, rather than the underlying data, when 
evaluating multiple comparison procedures. 


After generating the data for a simulation run, we perform 
our statistical tests and compute the corresponding p-values. 
Once this is done, we then apply the BH and BY procedures 
for various threshold values q—specifically, we use 0.05, 0.1, 
and 0.15 in all our evaluations. While a value of 0.05 is 
commonly used, it’s been argued that this threshold may 
be too low for some applications [26]; thus, we evaluate a 
range of values in our simulations. Based on the statisti- 
cal significance results from our application of the BH and 
BY procedures, we can compute Q, the proportion of false 
discoveries among all the discoveries made, using (1). To 
obtain our estimate of the FDR, we then compute the av- 
erage of Q over a total of 10,000 simulation runs. For the 
various values of g, we compare these FDR estimates to the 
values of FDRmax as defined in Section 2.1. 


At this point, it’s worth mentioning that the value of Q— 
and, hence, the estimated FDR value—can be very different 


from the false positive rate.° Using the notation in (2), 
the false positive rate can be written as v0: In compar- 


ison, Q is computed with the formula ee which has a 
different denominator. Thus, while the FDR is the expected 
proportion of false discoveries among all the rejected null 
hypotheses, the false positive rate is the (expected) propor- 
tion of false discoveries among all the true null hypothe- 
ses. Consider the following example. Assume we are testing 
20 total hypotheses, all of which are true null hypotheses 
(mo = m = 20). Furthermore, assume that one false posi- 
tive is recorded. Then, the false positive rate for this set of 


tests would be equal to THis = 0.05. However, applying (1) 


gives a value of Q = ee = 1. This discrepancy is some- 
thing to keep in mind as we analyze the results from our 


simulation studies in subsequent sections. 


4. MODEL COMPARISONS 


The first scenario we study concerns the comparison of sev- 
eral classification models on a fixed set of test or validation 
data. A common example of this occurs during the model 
building process, where it may be necessary to evaluate the 
performance of many different combinations of classification 
models and hyperparameters. In such a case, it can be help- 
ful for the researcher to run statistical tests to more precisely 
quantify the differences in performance. To that end, we 
focus on the pairwise comparisons of the classifiers, where 
we assume that the classifiers could have different underly- 
ing algorithms—e.g., logistic regression vs. random forest— 
or the same algorithm with different hyperparameters. We 
evaluate each pair of classifiers by looking at both the accu- 
racy and the area under the receiver operating characteristic 
curve (AUROC). To measure the possible difference between 
the accuracy values of the models, we use McNemar’s test 
[13, 27]. When conducting pairwise comparisons of classi- 
fier accuracy on a fixed set of test data—as opposed to a 
procedure such as k-fold cross-validation, where the test set 
varies—using McNemar’s test is recommended [10]; for these 
evaluations we use the implementation in the statsmodels 
[33] Python library. Then, to compare the AUROC values 
we use DeLong’s test [9], a method developed to statistically 
test for differences in AUROC values; specifically, we apply 
the fast version of the algorithm outlined in [36].* 


Our simulations use the following procedure. We assume 
that we are evaluating the performance of a binary classifier 
on a test set containing n data points; for these simulations 
we use n-values of 500, 1000, and 5000. For each value of n, 
we sample n numbers uniformly at random from 0.01 to 0.99; 
we refer to this set of numbers as U,. In each simulation 
run, the numbers in U, are used to generate the labels for 
our data using the following procedure. Let 7 be an integer 
from 1 to n, and let pi € Un. With probability p; we assign 
a label of 1 to y;; otherwise, with probability 1—p, it is then 
given a label of 0. Note that the set Un is generated once 
for each value of n, and this same set is then used repeatedly 
for all of our simulation runs with a test set of size n. 


° That is, while “false discovery” and “false positive” are used 
interchangeably, the terms “false discovery rate” and “false 
positive rate” have different definitions. 

“The code for our implementation of the algorithm in [36], 
as well as for running all of our experiments, is available at 
https://github.com/jmatayoshi/multiple-comparisons. 
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Table 1: Accuracy and AUROC values for an example sim- 
ulation run using a test set of size n = 1000. 


o 0.1 0.1 0.1 0.5 1 2 


Accuracy || 0.733 0.724 0.732 0.706 0.651 0.606 
AUROC || 0.824 0.821 0.824 0.787 0.721 0.655 


We next describe our procedure for simulating the classi- 
fier predictions. Let ci; represent the predicted probability 
given by classifier 7 for the i-th data point in our test set. To 
generate cj;, we begin by converting p; € Un to a z-score. 
Then, to add noise to the classifier’s prediction we randomly 
sample a value, s;;, from a normal distribution with mean 0 
and standard deviation o;, add this to the z-score, and then 
convert everything back to a probability; the resulting value 
is cij. The size of 0; controls the performance of the classi- 
fier, with lower values giving predicted probabilities that are 
less noisy and more likely to align with the class labels. Let 
® denote the cumulative distribution function (CDF) of the 
standard normal distribution. Our procedure for generating 
the classifier predictions can be summarized as follows. 


1. a= &~'(p;) 
2. Draw sample value si; from NV (0, 03) 
3. Cig = O(z: + Si;) 


To get an idea of the effect of different values of o on the per- 
formance of our simulated classifier predictions, in Table 1 
we show the accuracy and AUROC values from one simu- 
lation run, using different values of o and a test set size of 
n = 1000. The three classifiers with o values of 0.1 have the 
best performance, with accuracy values from 0.72 to 0.73 
and AUROC values around 0.82. The other classifiers, to 
varying degrees, perform worse, with the lowest accuracy 
and AUROC values at roughly 0.61 and 0.66, respectively. 
Our initial analysis simulates the pairwise comparison of six 
different classification models, where all the classifiers are 
assumed to perform equally; specifically, we use a value of 
o = 0.5 for each model. Using our previously described 
procedure, we generate a total of 10,000 simulation runs for 
each value of n. Our experimental setup results in (8) =15 
pairwise comparisons (m = 15), and as there are no underly- 
ing differences between the simulated classifiers, we have 15 
true null hypotheses (mo = 15). As such, if the conditions 
for the BH procedure are satisfied, we expect the FDR to 
be less than FDRmax = 7249 =q. The results are shown in 
Figures 1 and 2, where we display the estimated FDR rates 
for the BH and BY procedures, for different combinations 
of test set sizes and values of g. Using both McNemar’s test 
and DeLong’s test, the BH procedure appears to control the 
FDR by keeping it below the corresponding FDRmax value, 
shown by the dashed line, in all cases—that is, for all com- 
binations of test set sizes and q. In comparison, the BY 
procedure is much more conservative, with each estimated 
FDR value far below the FDRmax line. 


For our second set of simulations, we use the values of o 
that appear in Table 1 to generate six different models. As 
there are three models with the same value of 0 = 0.1, we 
have (3) = 3 true null hypotheses (mo = 3) out of 15 total 
comparisons (m = 15). Thus, under the appropriate condi- 
tions the BH procedure should keep the FDR at or below 


FDRmax = aq = zq- The results are given in Figures 3 and 
4, where we can see that the estimated FDR values using the 
BH procedure are at or below the value of FDRmax, given 
by the dashed line, in all cases—that is, for all combinations 
of test set sizes and q. As before, the estimated FDR values 
from the BY procedure are very low, with each value again 
appearing far below the corresponding FDRmax line. 


These results are seemingly consistent with previous works 
analyzing the performance of the BH procedure with pair- 
wise comparisons [21, 38]. The findings from several of these 
studies are summarized in [39], where the author states that 
in “all the studies, for all configurations of true and false hy- 
potheses simulated, for balanced and for non-balanced de- 
signs, normal and non-normal distributions, the BH proce- 
dure controlled the FDR.” Thus, combining these previous 
results with our experiments from this section, there appears 
to be good evidence that the BH procedure properly controls 
the FDR in the case of pairwise comparisons of classification 
models. We return to this subject in the discussion. 


5. TRANSITIONS IN SEQUENTIAL DATA 


In our second scenario we look at data that are sequential 
in nature, as examples of such data appear in many areas of 
educational research. One particular focus with sequential 
data is the analysis of transitions between different states— 
or events—in these sequences. Researchers are often inter- 
ested in understanding if transitions between certain pairs 
of states are significant, either because they happen often or 
because they rarely appear. Typically in such cases, many 
pairs of states are evaluated with statistical tests, thus neces- 
sitating a correction for multiple comparisons. For example, 
past studies have analyzed logs of student actions in learn- 
ing systems, in an attempt to understand the differences 
between productive and unproductive transitions between 
activities within these systems [5, 6]. Another example is 
affect dynamics research, which studies sequences of student 
affective states, with the goal of understanding how students 
transition between these different states. Previous works in 
this area have used the BH procedure to control the FDR 
[18, 29], and as such the goal of our next analysis is to in- 
vestigate the appropriateness of using this procedure when 
analyzing state transitions. 


5.1 Experimental Setup 

Our numerical experiments for sequential data evaluate the 
BH and BY procedures on simulated sequences of states. 
Each of these sequences could represent, for example, a stu- 
dent’s affective states while working in a learning system. 
The states are randomly sampled according to the proba- 
bility distribution given in Table 2; each entry in the table 
gives the probability of sampling the next state (column) 
based on the value of the previous state (row). For exam- 
ple, suppose that C is the previous state. In this case, A 
has a probability of 0.2 of being the next state, B has a 
probability of 0.2 — 7 of being the next state, and so on. 


For our simulations, we use two different values for y: 0, 
which results in all 25 hypotheses being true null hypothe- 
ses; and 0.05, which results in 21 true null hypotheses, out of 
the 25. For each value of 7, we generate n sequences consist- 
ing of 20 states each. To generate these sequences, the first 
state in each sequence is sampled randomly from the five 
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Figure 1: Comparison of the estimated FDR for the BH and BY procedures, using McNemar’s test and six classifiers with 
the same value of o = 0.5. Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 2: Comparison of the estimated FDR for the BH and BY procedures, using DeLong’s test and six classifiers with the 
same value of o = 0.5. Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 3: Comparison of the estimated FDR for the BH and BY procedures, using McNemar’s test and the o values in Table 1. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 4: Comparison of the estimated FDR for the BH and BY procedures, using DeLong’s test and the o values in Table 1. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 


Table 2: Probability distribution used to generate the sim- 
ulated sequences of states. Each entry represents the prob- 
ability of making a transition to the next state (column), 
given the previous state (row). 


4 £: a Pp # 
prev 
A 0.2 O02+7 0.2 02-y 0.2 
B 0.2 0.2 0.2 0.2 0.2 
C 0.2 02-y 02 O02+7 0.2 
D 0.2 0.2 0.2 0.2 0.2 
E 0.2 0.2 0.2 0.2 0.2 
Table 3: Marginal model coefficient p-values from one sim- 


ulation run using y = 0.05. With a threshold of g = 0.05, 
both the BH and BY procedures give the same statistical 
significance results for this example; namely, only the four 
transition pairs with sample probabilities modified by y are 
statistically significant. 


next 


pies A B C D E 
A 0.252 0.000 0.335 0.000 0.703 
B 0.496 0.365 0.327 0.864 0.252 
C 0.035 0.000 0.527 0.000 0.569 
D 0.260 0.652 0.080 0.980 0.889 
E 0.581 0.099 0.800 0.869 0.179 


choices, and then all subsequent states are sampled accord- 
ing to the probability distribution in Table 2. For each set 
of n sequences we evaluate our statistical tests (described in 
Sections 5.2 and 5.3) and then compute the resulting value 
for Q; this constitutes one simulation run. We then perform 
10,000 simulation runs for each value of n in order to obtain 
an estimate of the true FDR. For this analysis, we use the 
following values of n: 50, 100, and 200. 


The L statistic, originally introduced in [12], is intended to 
be used as a measure of the significance of different pairs 
of transitions, and it has been widely applied in the study 
of affect dynamics [11, 12, 18]. Given two states A and B, 
it measures the likelihood of transitions from A to B while 
taking into account the overall frequency at which B occurs. 
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However, several recent works have revealed issues with the 
use of the L statistic for the analysis of state transitions 
[7, 18, 19]. Thus, for our simulations we use two newer 
methods that have been developed in response to the prob- 
lems with the L statistic. First, in Section 5.2 we look at the 
performance of the BH procedure when used in combination 
with the marginal model approach outlined in [25]. Then, 
in Section 5.3 we evaluate the BH procedure when it is used 
with the modified version of the L statistic from [24]. 


5.2 Marginal Model 

To estimate the influence that starting in state A has on the 
probability of making a transition to B, in this section we use 
the marginal model regression procedure from [25]. In this 
approach, the regression model has a binary response vari- 
able, where the value of this variable is one if the next state 
is equal to B, and it is zero otherwise. Based on the binary 
response variable, we use the logit as our link function. Our 
predictor—or, independent—variable is also binary, with a 
value of one if the previous state is equal to A and zero 
otherwise. We can summarize this procedure as follows. 


e y= yit: one if B is the next state for student 7 at time 
t; zero otherwise 


e x =x: one if A is the previous state for student 7 at 
time t; zero otherwise 


Letting S represent the standard logistic function, the re- 
gression equation then has the form 


1 


P(yie = 1] vie) = S(Bo + Brie) 1 + e—(BotBizit)” 


(3) 


When xz = 1 the regression model returns an estimate for 
P(B|A), the probability of a transition to B, given that 
the starting state is A. Then, when x = 0 it returns an 
estimate for P(B| A), the probability of a transition to B, 
given that the starting state is not A. Thus, to measure 
the importance of starting in state A, we focus on testing 
if the value of (6, is significantly different from zero. This 
is done using a two-tailed z-test on the value of 8; for each 
individual fit of the regression model. 


Finally, as the sequential data used in these analyses typi- 
cally take the form of repeated measurements on a student, 
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the result is a set of dependent—or correlated—data. To 
account for this dependence, as outlined in [25] we use a 
marginal model, based on generalized estimating equations 
(GEE) [17, 23], to estimate the logistic regression coeffi- 
cients; in particular, we use the GEE implementation from 
the statsmodels Python library. 


As before, let m denote the total number of statistical tests, 
with mo < m representing the number of true null hypothe- 
ses. Using the BH procedure with a value of y = 0, we 
have mo = m; as such, we would expect the FDR to be less 
than FDRmax = q = q if the BH conditions are satisfied. 
Then, for all values of 7 > 0 we would expect the FDR to be 
less than FDRmax = 4, assuming the BH conditions are 
satisfied, as mo = 21 of the tests are true null hypotheses. 


The first set of results, using a value of 7 = 0, is shown in 
Figure 5. Here, we can see that in all cases the estimated 
FDR values from the BH procedure are above the theoret- 
ical upper bound of FDRmax, shown by the dashed line. 
The gap is particularly notable with smaller numbers of se- 
quences. On the other hand, the BY procedure offers much 
more stringent control of the FDR, with all of the estimated 
values appearing below the FDRmax line. Figure 6 then 
shows the results from using a value of y = 0.05. Overall, 
the picture appears similar to the y = 0 case, with the esti- 
mated FDR values from the BH procedure always appearing 
above the FDRmax line, and with the difference again being 
more pronounced with smaller numbers of sequences. 


5.3. Removing Self-Transitions 

Our final set of experiments investigates a specific situa- 
tion in sequential data analysis that occurs when researchers 
want to remove the influence of repeated states. To do this, 
many researchers in the affect dynamics community remove 
self-transitions—i.e., transitions where the same state is re- 
peated for more than one step—before analyzing the data 
[18]. However, this procedure has been shown to overesti- 
mate the significance of transitions when used with the L 
statistic [19]. Thus, for this analysis we instead use a modi- 
fied version of the L statistic, named L* [24]. 


DEFINITION 1. Let A and B be two states, and let 
Tz = {transitions where the next state is not A}. (4) 
Then, we define 


B|A,Tz) — P(B| Tz) 


* wl 


(5) 


where P(B|A,T z) is the probability of a transition to B in 
Tx, given that the starting state is A, while P(B|T z) is the 
overall probability of a transition to B in Tz. 


The base rate of the state B, given by P(B|T x) in (5), 
can be computed either individually for each sequence, or 
averaged over the entire set of sequences. For the computa- 
tions in the remainder of this work, we compute these rates 
individually per sequence. 


Our analysis using L* applies the statistic to the sequences 
from our experiments in Section 5.2. Specifically, we take 


each sequence and, for each pair of transition states, com- 
pute (5). To test for statistical significance, we follow the 
procedure outlined in [24] and apply a two-tailed t-test to 
the L* values. The results for the y = 0 and y = 0.05 se- 
quences are shown in Figures 7 and 8, respectively. While 
perhaps not quite as prominent as with the marginal model 
procedure, there are several examples where the estimated 
FDR values from the BH procedure are clearly above the 
FDRmax line. As with the marginal model procedure, the 
worst cases occur with the smallest number of sequences. 


5.4 Dependence of the Statistical Tests 

The experiments in this section provide evidence that, when 
used in combination with either the marginal model proce- 
dure or L*, the BH procedure does not always control the 
FDR at the desired level; in turn, this may indicate that the 
conditions for applying the BH procedure are not satisfied. 
In the remainder of this section, we outline two arguments 
that show the assumption of independence is violated be- 
tween the statistical tests used in these analyses. Note that 
these are not rigorous mathematical proofs; rather, our goal 
here is to simply give some intuition into the relationships 
between the statistical tests. 


Consider a set of sequential data consisting of possible states 
A, B, C, D, and E. For states A and B, let 64,5 represent 
the value of 8; in (3) for transitions of the form A > B. 
Suppose that the following inequalities hold. 


Ba,a >0 
Ba,c > 0 


Ba,Bp > 0 
Bap >0O 


(6) 


Consider, for example, 64,2. The corresponding marginal 
model estimates the probability of a transition to B, de- 
pending on whether or not the starting state is A—these es- 
timates correspond to P(B|A) and P(B| A), respectively. 
The inequalities in (6) can then be interpreted as follows. 


P(A| A) > P(A| A) 
P(C| A) > P(C|A) 


P(B|A) > P(B|A) 
P(D|A) > P(D|A) 
Next, consider the following two equalities. 

P(E | A) =1-— P(A| A) — P(B|A) — P(C| A) — P( 
P(E|A)=1-—P(A| A) — P(B| A) — P(C| A) — P( 


D|A) 
D|A) 
(8) 


Combining (7) and (8), it follows that P(E| A) < P(E| A), 
or, equivalently, that Baz < 0. What this argument il- 
lustrates is that it’s not possible—or, at least, it’s highly 
unlikely—for 8,4,r to be positive when the other four coeffi- 
cients are positive, which means that the corresponding sta- 
tistical tests are not completely independent of each other. 


Next, suppose we are in the situation of removing self-transitions 


and applying L*; thus, in what follows assume we are inter- 
ested in transitions from A to B and that, following (4) in 
Definition 1, all transitions to A have been removed from 
our sequence. Suppose the following inequalities hold. 
P(B| A) > P(B) 
P(C| A) > P(C) (9) 
P(D|A)> P(D) 
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Figure 5: Comparison of the estimated FDR for the BH and BY procedures, using a value of 7 = 0 and the marginal model 
method. Vertical lines represent the 99% confidence interval for each estimated FDR. value. 
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Figure 6: Comparison of the estimated FDR for the BH and BY procedures, using a value of 7 = 0.05 and the marginal model 
method. Vertical lines represent the 99% confidence interval for each estimated FDR. value. 


Consider the equalities 


P(E| A) =1- P(B| A) — P(C| A) — P(D|A) 


P(E) =1—- P(B)- P(C) - P(D), ae) 


where we’re using the fact that, as we are removing transi- 
tions to A, P(A| A) = 0 and P(A) = 0. Combining (9) and 
(10), it follows that P(E| A) < P(£). Thus, it’s not possi- 
ble for all four of the conditional probabilities to be larger 
than the base probabilities—in turn, this means that at least 
one of the L* values must be negative. As such, it follows 
that the corresponding statistical tests are not completely 
independent of each other. 


6. DISCUSSION 


In this paper, we investigated the validity of methods used to 
adjust for false discoveries when performing multiple com- 
parisons. In two scenarios relevant to EDM research, we 
evaluated the performance of the commonly used BH proce- 
dure in relation to an alternate method—the BY procedure— 
that is more general and is valid to use when the assump- 
tions of the BH procedure cannot be met. Our first set 
of experiments looked at the performance of these proce- 
dures when used with pairwise comparisons of classification 
models on a fixed set of test data. In all our experiments, 
using both accuracy and AUROC as our performance met- 


rics, the BH procedure controlled the FDR at the expected 
level. These results are consistent with previous studies in- 
vestigating pairwise comparisons, where in all cases the BH 
procedure properly controlled the FDR [21, 38, 39]. Com- 
bining these previous results with the experiments in this 
study, our current view is that the usage of the BH pro- 
cedures appears justified in this scenario—that is, one can 
reasonably expect the BH procedure to properly control the 
FDR when performing pairwise comparisons of classifiers on 
a fixed set of test data. 


Contrast this with our investigation on sequential data, where 
we observed that the BH procedure, when combined with 
either the marginal model procedure or L*, did not control 
the FDR at the expected level—this happened with various 
experimental conditions and for various threshold values q. 
The results could be an indication that the theoretical condi- 
tions for applying the BH procedure might not be satisfied in 
these situations. Combined with the fact that various issues 
involving the analysis of state transitions have recently come 
to light [7, 18, 19, 24, 25], we believe that using the more 
conservative BY procedure is justified, particularly when the 
analysis involves a small number of sequences. To compen- 
sate for the fact that it is more conservative, when applying 
the BY procedure we suggest the use of a larger value of q, 
such as 0.1. 
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Figure 7: Comparison of the estimated FDR for the BH and BY procedures, using a value of y = 0 and the L” statistic. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 
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Figure 8: Comparison of the estimated FDR for the BH and BY procedures, using a value of y = 0.05 and the L* statistic. 
Vertical lines represent the 99% confidence interval for each estimated FDR value. 


More generally, it’s worth noting that there are many ex- 
amples where the BH procedure performs well without any 
theoretical guarantees [14, 22]. Thus, for situations in which 
the BH procedure has not been theoretically or empirically 
vetted, we offer a couple of suggestions. First, whenever pos- 
sible, conducting a simulation study may be helpful; as seen 
in this work, the results could give evidence for or against 
the usage of the BH procedure. Failing that, and if there 
is good reason to doubt the validity of using the BH proce- 
dure, we suggest that the BY procedure be considered as a 
possible alternative. In these cases, a higher value for q may 
be justified in order to compensate for the more restrictive 
nature of the BY procedure, and this decision could be made 
based on the context of the study. For instance, in studies 
that are exploratory in nature or have small sample sizes, 
the loss of statistical power might be a larger concern; thus, 
the BY procedure using a threshold of 0.1 or larger may 
be appropriate. Whereas, in an experimental study looking 
for conclusive evidence, it may be preferable to use the BY 
procedure with a smaller value of q. 


In regards to future work in this area, it would be of interest 
to more completely understand why the BH procedure fails 
to properly control the FDR in our simulations with sequen- 
tial data. While we presented an argument in Section 5.4 
that showed the statistical tests are not independent, it’s 


an open question whether this argument can be extended to 
rigorously show that the assumptions of the BH procedure 
are violated—we are currently looking at this in more de- 
tail. Furthermore, it’s possible that other elements may also 
be at play. For example, as discussed previously there are 
known issues with several existing methods commonly used 
to evaluate state transitions. While the methods we used 
in this study were originally developed in response to these 
problems [24, 25], it’s possible that these existing issues, or 
perhaps even new ones, are a factor; thus, further adjust- 
ments to the marginal model and L* methods could lead to 
improved control of the FDR with the BH procedure. 


There exist other directions for future work that we are cur- 
rently exploring. First, as the literature on multiple com- 
parisons and controlling the FDR is actively growing, many 
methods have been developed over the years. Thus, while 
the BH and BY procedures are arguably the most notable of 
the FDR controlling procedures, it would be worthwhile to 
evaluate some of the newer alternatives, especially for the 
analysis of state transitions. Second, our analyses in this 
work focused exclusively on false discoveries (Type I errors) 
and did not consider false negatives (Type II errors). As 
such, in future work we aim to explicitly examine the inter- 
action between these two types of errors with respect to the 
BH and BY procedures and EDM research. 
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